[TRTLLM-11037][bug] Fix MoE DeepEP hang caused by non-deterministic GC#12060
[TRTLLM-11037][bug] Fix MoE DeepEP hang caused by non-deterministic GC#12060xxi-nv wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
Fix multi-GPU hangs caused by non-deterministic GC of DeepEP buffers by adding explicit destroy() to Communication classes and calling it before fallback to AllGatherReduceScatter. Also add explicit comm resource cleanup in test teardown to prevent inter-test barrier deadlock from lingering DeepEP buffers. Remove the now-unnecessary DEEPEPLOWLATENCY skip on H100 (SM90). Signed-off-by: xxi <xxi@nvidia.com>
|
/bot run |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (6)
💤 Files with no reviewable changes (1)
📝 WalkthroughWalkthroughThis pull request adds lifecycle management to TensorRT-LLM's fused MoE communication strategies. A Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~15 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Comment |
|
PR_Github #38362 [ run ] triggered by Bot. Commit: |
|
PR_Github #38362 [ run ] completed with state
|
Summary
Buffer.__del__callsintranode::barrier(a collective op). Without explicit synchronous release, GC timing differences across ranks cause some ranks to block in the barrier indefinitelydestroy()method toCommunicationbase class andDeepEP/DeepEPLowLatencysubclasses for explicit buffer releasedestroy()inConfigurableMoEwhen falling back from DeepEP to AllGatherReduceScatterConfigurableMoEa context manager so resources are released on scope exitTest plan
with create_moe(...) as fused_moe)Summary by CodeRabbit
New Features
Tests